DOMAIN: Industrial safety. NLP based Chatbot.
CONTEXT:
The database comes from one of the biggest industry in Brazil and in the world. It is an urgent need for industries/companies around the globe to understand why employees still suffer some injuries/accidents in plants. Sometimes they also die in such environment.
DATA DESCRIPTION:
This The database is basically records of accidents from 12 different plants in 03 different countries which every line in the data is an occurrence of an accident.
Columns description:
Link to download the dataset: https://www.kaggle.com/ihmstefanini/industrial-safety-and-health-analytics-database [ for your reference only ]
#!pip install contractions seaborn
!pip install wordcloud
Requirement already satisfied: wordcloud in c:\programs\anaconda3\envs\tf\lib\site-packages (1.8.2.2) Requirement already satisfied: pillow in c:\programs\anaconda3\envs\tf\lib\site-packages (from wordcloud) (9.2.0) Requirement already satisfied: numpy>=1.6.1 in c:\programs\anaconda3\envs\tf\lib\site-packages (from wordcloud) (1.22.4) Requirement already satisfied: matplotlib in c:\programs\anaconda3\envs\tf\lib\site-packages (from wordcloud) (3.5.2) Requirement already satisfied: python-dateutil>=2.7 in c:\programs\anaconda3\envs\tf\lib\site-packages (from matplotlib->wordcloud) (2.8.2) Requirement already satisfied: packaging>=20.0 in c:\programs\anaconda3\envs\tf\lib\site-packages (from matplotlib->wordcloud) (21.3) Requirement already satisfied: kiwisolver>=1.0.1 in c:\programs\anaconda3\envs\tf\lib\site-packages (from matplotlib->wordcloud) (1.4.4) Requirement already satisfied: fonttools>=4.22.0 in c:\programs\anaconda3\envs\tf\lib\site-packages (from matplotlib->wordcloud) (4.34.4) Requirement already satisfied: pyparsing>=2.2.1 in c:\programs\anaconda3\envs\tf\lib\site-packages (from matplotlib->wordcloud) (3.0.9) Requirement already satisfied: cycler>=0.10 in c:\programs\anaconda3\envs\tf\lib\site-packages (from matplotlib->wordcloud) (0.11.0) Requirement already satisfied: six>=1.5 in c:\programs\anaconda3\envs\tf\lib\site-packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0)
import os
import re
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from wordcloud import WordCloud, STOPWORDS
import random as python_random
from gensim.models import Word2Vec
from tqdm import tqdm
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, ExtraTreesClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, recall_score, precision_score, classification_report, precision_recall_fscore_support, make_scorer
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from tensorflow import get_logger
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Flatten, Activation, Dense, LSTM, BatchNormalization, Embedding, Dropout, Bidirectional, GlobalMaxPool1D, Conv1D, MaxPooling1D
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import plot_model, to_categorical
from tensorflow.random import set_seed
# import lightgbm as lgb
from keras.callbacks import ReduceLROnPlateau, ReduceLROnPlateau, EarlyStopping, Callback
from keras.layers import Input
from keras.constraints import unit_norm
from keras.regularizers import l2
import missingno as mno
import holidays
from string import punctuation
import warnings
warnings.filterwarnings('ignore')
import contractions
import pickle
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet, brown
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.util import ngrams
nltk.download('punkt')
nltk.download("stopwords")
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('brown')
nltk.download('averaged_perceptron_tagger')
from google.colab import drive, files
[nltk_data] Downloading package punkt to [nltk_data] C:\Users\prije\AppData\Roaming\nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package stopwords to [nltk_data] C:\Users\prije\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date! [nltk_data] Downloading package wordnet to [nltk_data] C:\Users\prije\AppData\Roaming\nltk_data... [nltk_data] Package wordnet is already up-to-date! [nltk_data] Downloading package omw-1.4 to [nltk_data] C:\Users\prije\AppData\Roaming\nltk_data... [nltk_data] Package omw-1.4 is already up-to-date! [nltk_data] Downloading package brown to [nltk_data] C:\Users\prije\AppData\Roaming\nltk_data... [nltk_data] Package brown is already up-to-date! [nltk_data] Downloading package averaged_perceptron_tagger to [nltk_data] C:\Users\prije\AppData\Roaming\nltk_data... [nltk_data] Package averaged_perceptron_tagger is already up-to- [nltk_data] date!
True
from imblearn.over_sampling import RandomOverSampler
from sklearn import preprocessing
from tensorflow.keras.backend import clear_session
from tensorflow.keras.models import load_model
import joblib
from google.colab import drive
drive.mount('/content/drive')
os.chdir("/content/drive/MyDrive/Capstone project")
!ls
Volume in drive D is DATA
Volume Serial Number is 7042-8DAE
Directory of D:\Prijesh\study\GreatLearningAIML\047-capstone-project\final
24-07-2022 07:40 PM <DIR> .
24-07-2022 07:40 PM <DIR> ..
24-07-2022 07:40 PM <DIR> .ipynb_checkpoints
24-07-2022 07:40 PM 6,993,322 Group13_NLP2_July21A_Capstone_Project (3).ipynb
10-06-2022 12:31 PM 35,695 IHMStefanini_industrial_safety_and_health_database.csv
10-06-2022 12:31 PM 193,631 IHMStefanini_industrial_safety_and_health_database_with_accidents_description.csv
3 File(s) 7,222,648 bytes
3 Dir(s) 20,169,076,736 bytes free
dataset1 = pd.read_csv('IHMStefanini_industrial_safety_and_health_database.csv')
dataset2 = pd.read_csv('IHMStefanini_industrial_safety_and_health_database_with_accidents_description.csv')
dataset1.head()
| Data | Countries | Local | Industry Sector | Accident Level | Potential Accident Level | Genre | Employee ou Terceiro | Risco Critico | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-01-01 00:00:00 | Country_01 | Local_01 | Mining | I | IV | Male | Third Party | Pressed |
| 1 | 2016-01-02 00:00:00 | Country_02 | Local_02 | Mining | I | IV | Male | Employee | Pressurized Systems |
| 2 | 2016-01-06 00:00:00 | Country_01 | Local_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools |
| 3 | 2016-01-08 00:00:00 | Country_01 | Local_04 | Mining | I | I | Male | Third Party | Others |
| 4 | 2016-01-10 00:00:00 | Country_01 | Local_04 | Mining | IV | IV | Male | Third Party | Others |
dataset2.head()
| Unnamed: 0 | Data | Countries | Local | Industry Sector | Accident Level | Potential Accident Level | Genre | Employee or Third Party | Critical Risk | Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2016-01-01 00:00:00 | Country_01 | Local_01 | Mining | I | IV | Male | Third Party | Pressed | While removing the drill rod of the Jumbo 08 f... |
| 1 | 1 | 2016-01-02 00:00:00 | Country_02 | Local_02 | Mining | I | IV | Male | Employee | Pressurized Systems | During the activation of a sodium sulphide pum... |
| 2 | 2 | 2016-01-06 00:00:00 | Country_01 | Local_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | In the sub-station MILPO located at level +170... |
| 3 | 3 | 2016-01-08 00:00:00 | Country_01 | Local_04 | Mining | I | I | Male | Third Party | Others | Being 9:45 am. approximately in the Nv. 1880 C... |
| 4 | 4 | 2016-01-10 00:00:00 | Country_01 | Local_04 | Mining | IV | IV | Male | Third Party | Others | Approximately at 11:45 a.m. in circumstances t... |
dataset1.columns
Index(['Data', 'Countries', 'Local', 'Industry Sector', 'Accident Level',
'Potential Accident Level', 'Genre', 'Employee ou Terceiro',
'Risco Critico'],
dtype='object')
dataset2.columns
Index(['Unnamed: 0', 'Data', 'Countries', 'Local', 'Industry Sector',
'Accident Level', 'Potential Accident Level', 'Genre',
'Employee or Third Party', 'Critical Risk', 'Description'],
dtype='object')
dataset1.shape, dataset2.shape
((439, 9), (425, 11))
We can conclude that dataset1 is having 439 records and 9 columns and dataset2 is having 425 records and 11 columns.
Since it is an NLP problem and the Description field is a mandatory item, we can proceed further with dataset2. So we are finalizing dataset2 for further processing by assinging it to a variable called df
df = dataset2.copy()
df.dtypes
Unnamed: 0 int64 Data object Countries object Local object Industry Sector object Accident Level object Potential Accident Level object Genre object Employee or Third Party object Critical Risk object Description object dtype: object
As we already mentioned, we can remove the Unnamed: 0 column as it contains only the index values.
df.drop("Unnamed: 0", axis=1, inplace=True)
Let's rename the columns with meaningful names.
df.rename(columns={'Data':'Date', 'Countries':'Country', 'Genre':'Gender', 'Employee or Third Party':'Employee type'}, inplace=True)
df.head(3)
| Date | Country | Local | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee type | Critical Risk | Description | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-01-01 00:00:00 | Country_01 | Local_01 | Mining | I | IV | Male | Third Party | Pressed | While removing the drill rod of the Jumbo 08 f... |
| 1 | 2016-01-02 00:00:00 | Country_02 | Local_02 | Mining | I | IV | Male | Employee | Pressurized Systems | During the activation of a sodium sulphide pum... |
| 2 | 2016-01-06 00:00:00 | Country_01 | Local_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | In the sub-station MILPO located at level +170... |
Let's verify the duplicate rows on the dataset.
df.duplicated().sum()
7
duplicates = df.duplicated()
df[duplicates]
| Date | Country | Local | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee type | Critical Risk | Description | |
|---|---|---|---|---|---|---|---|---|---|---|
| 77 | 2016-04-01 00:00:00 | Country_01 | Local_01 | Mining | I | V | Male | Third Party (Remote) | Others | In circumstances that two workers of the Abrat... |
| 262 | 2016-12-01 00:00:00 | Country_01 | Local_03 | Mining | I | IV | Male | Employee | Others | During the activity of chuteo of ore in hopper... |
| 303 | 2017-01-21 00:00:00 | Country_02 | Local_02 | Mining | I | I | Male | Third Party (Remote) | Others | Employees engaged in the removal of material f... |
| 345 | 2017-03-02 00:00:00 | Country_03 | Local_10 | Others | I | I | Male | Third Party | Venomous Animals | On 02/03/17 during the soil sampling in the re... |
| 346 | 2017-03-02 00:00:00 | Country_03 | Local_10 | Others | I | I | Male | Third Party | Venomous Animals | On 02/03/17 during the soil sampling in the re... |
| 355 | 2017-03-15 00:00:00 | Country_03 | Local_10 | Others | I | I | Male | Third Party | Venomous Animals | Team of the VMS Project performed soil collect... |
| 397 | 2017-05-23 00:00:00 | Country_01 | Local_04 | Mining | I | IV | Male | Third Party | Projection of fragments | In moments when the 02 collaborators carried o... |
Let's drop the duplicate rows on the dataset.
df.drop_duplicates(inplace=True)
df.shape
(418, 10)
Let's print the unique values for each fields on the dataset(excluding Description field).
for x in df.columns:
if x != 'Description':
print('--'*30); print(f'Unique values of "{x}" column'); print('--'*30)
print(df[x].unique())
print('\n')
------------------------------------------------------------ Unique values of "Date" column ------------------------------------------------------------ ['2016-01-01 00:00:00' '2016-01-02 00:00:00' '2016-01-06 00:00:00' '2016-01-08 00:00:00' '2016-01-10 00:00:00' '2016-01-12 00:00:00' '2016-01-16 00:00:00' '2016-01-17 00:00:00' '2016-01-19 00:00:00' '2016-01-26 00:00:00' '2016-01-28 00:00:00' '2016-01-30 00:00:00' '2016-02-01 00:00:00' '2016-02-02 00:00:00' '2016-02-04 00:00:00' '2016-02-06 00:00:00' '2016-02-07 00:00:00' '2016-02-08 00:00:00' '2016-02-21 00:00:00' '2016-02-25 00:00:00' '2016-02-09 00:00:00' '2016-02-10 00:00:00' '2016-02-15 00:00:00' '2016-02-14 00:00:00' '2016-02-13 00:00:00' '2016-02-16 00:00:00' '2016-02-17 00:00:00' '2016-02-19 00:00:00' '2016-02-20 00:00:00' '2016-02-18 00:00:00' '2016-02-22 00:00:00' '2016-02-24 00:00:00' '2016-02-29 00:00:00' '2016-02-26 00:00:00' '2016-02-27 00:00:00' '2016-03-02 00:00:00' '2016-03-03 00:00:00' '2016-03-04 00:00:00' '2016-03-05 00:00:00' '2016-03-06 00:00:00' '2016-03-09 00:00:00' '2016-03-11 00:00:00' '2016-03-13 00:00:00' '2016-03-12 00:00:00' '2016-03-14 00:00:00' '2016-03-16 00:00:00' '2016-03-10 00:00:00' '2016-03-17 00:00:00' '2016-03-18 00:00:00' '2016-03-19 00:00:00' '2016-03-22 00:00:00' '2016-03-25 00:00:00' '2016-03-30 00:00:00' '2016-03-31 00:00:00' '2016-04-01 00:00:00' '2016-04-03 00:00:00' '2016-04-02 00:00:00' '2016-03-24 00:00:00' '2016-04-04 00:00:00' '2016-04-05 00:00:00' '2016-04-07 00:00:00' '2016-04-08 00:00:00' '2016-04-11 00:00:00' '2016-04-14 00:00:00' '2016-04-16 00:00:00' '2016-04-15 00:00:00' '2016-04-17 00:00:00' '2016-04-18 00:00:00' '2016-04-21 00:00:00' '2016-04-22 00:00:00' '2016-04-23 00:00:00' '2016-04-26 00:00:00' '2016-04-28 00:00:00' '2016-04-29 00:00:00' '2016-04-30 00:00:00' '2016-05-01 00:00:00' '2016-05-02 00:00:00' '2016-05-04 00:00:00' '2016-05-03 00:00:00' '2016-05-05 00:00:00' '2016-05-11 00:00:00' '2016-05-12 00:00:00' '2016-05-14 00:00:00' '2016-05-17 00:00:00' '2016-05-19 00:00:00' '2016-05-18 00:00:00' '2016-05-22 00:00:00' '2016-05-20 00:00:00' '2016-05-24 00:00:00' '2016-05-25 00:00:00' '2016-05-27 00:00:00' '2016-05-26 00:00:00' '2016-06-01 00:00:00' '2016-06-02 00:00:00' '2016-06-03 00:00:00' '2016-06-04 00:00:00' '2016-06-05 00:00:00' '2016-06-08 00:00:00' '2016-06-07 00:00:00' '2016-06-10 00:00:00' '2016-06-13 00:00:00' '2016-06-16 00:00:00' '2016-06-18 00:00:00' '2016-06-17 00:00:00' '2016-06-19 00:00:00' '2016-06-21 00:00:00' '2016-06-22 00:00:00' '2016-06-23 00:00:00' '2016-06-24 00:00:00' '2016-06-29 00:00:00' '2016-07-02 00:00:00' '2016-07-04 00:00:00' '2016-07-08 00:00:00' '2016-07-07 00:00:00' '2016-07-09 00:00:00' '2016-07-10 00:00:00' '2016-07-11 00:00:00' '2016-07-14 00:00:00' '2016-07-15 00:00:00' '2016-07-16 00:00:00' '2016-07-18 00:00:00' '2016-07-20 00:00:00' '2016-07-21 00:00:00' '2016-07-23 00:00:00' '2016-07-27 00:00:00' '2016-07-29 00:00:00' '2016-07-30 00:00:00' '2016-08-02 00:00:00' '2016-08-01 00:00:00' '2016-08-04 00:00:00' '2016-08-11 00:00:00' '2016-08-12 00:00:00' '2016-08-14 00:00:00' '2016-08-15 00:00:00' '2016-08-18 00:00:00' '2016-08-19 00:00:00' '2016-08-22 00:00:00' '2016-08-24 00:00:00' '2016-08-25 00:00:00' '2016-08-29 00:00:00' '2016-08-27 00:00:00' '2016-08-30 00:00:00' '2016-09-01 00:00:00' '2016-09-02 00:00:00' '2016-09-04 00:00:00' '2016-09-03 00:00:00' '2016-09-06 00:00:00' '2016-09-05 00:00:00' '2016-09-13 00:00:00' '2016-09-12 00:00:00' '2016-09-15 00:00:00' '2016-09-17 00:00:00' '2016-09-16 00:00:00' '2016-09-20 00:00:00' '2016-09-21 00:00:00' '2016-09-22 00:00:00' '2016-09-27 00:00:00' '2016-09-29 00:00:00' '2016-09-30 00:00:00' '2016-10-01 00:00:00' '2016-10-03 00:00:00' '2016-10-04 00:00:00' '2016-10-08 00:00:00' '2016-10-10 00:00:00' '2016-10-11 00:00:00' '2016-10-13 00:00:00' '2016-10-18 00:00:00' '2016-10-20 00:00:00' '2016-10-23 00:00:00' '2016-10-24 00:00:00' '2016-10-26 00:00:00' '2016-10-27 00:00:00' '2016-10-29 00:00:00' '2016-11-04 00:00:00' '2016-11-08 00:00:00' '2016-11-11 00:00:00' '2016-11-13 00:00:00' '2016-11-19 00:00:00' '2016-11-21 00:00:00' '2016-11-23 00:00:00' '2016-11-25 00:00:00' '2016-11-28 00:00:00' '2016-11-29 00:00:00' '2016-11-30 00:00:00' '2016-12-01 00:00:00' '2016-12-08 00:00:00' '2016-12-09 00:00:00' '2016-12-10 00:00:00' '2016-12-12 00:00:00' '2016-12-13 00:00:00' '2016-12-15 00:00:00' '2016-12-16 00:00:00' '2016-12-19 00:00:00' '2016-12-23 00:00:00' '2016-12-22 00:00:00' '2016-12-26 00:00:00' '2016-12-28 00:00:00' '2016-12-30 00:00:00' '2016-12-31 00:00:00' '2017-01-02 00:00:00' '2017-01-05 00:00:00' '2017-01-06 00:00:00' '2017-01-07 00:00:00' '2017-01-08 00:00:00' '2017-01-09 00:00:00' '2017-01-10 00:00:00' '2017-01-12 00:00:00' '2017-01-14 00:00:00' '2017-01-17 00:00:00' '2017-01-20 00:00:00' '2017-01-21 00:00:00' '2017-01-23 00:00:00' '2017-01-24 00:00:00' '2017-01-25 00:00:00' '2017-01-27 00:00:00' '2017-01-29 00:00:00' '2017-01-28 00:00:00' '2017-01-31 00:00:00' '2017-02-01 00:00:00' '2017-02-04 00:00:00' '2017-02-05 00:00:00' '2017-02-07 00:00:00' '2017-02-08 00:00:00' '2017-02-09 00:00:00' '2017-02-13 00:00:00' '2017-02-14 00:00:00' '2017-02-15 00:00:00' '2017-02-16 00:00:00' '2017-02-17 00:00:00' '2017-02-23 00:00:00' '2017-02-25 00:00:00' '2017-02-26 00:00:00' '2017-02-27 00:00:00' '2017-03-01 00:00:00' '2017-03-02 00:00:00' '2017-03-04 00:00:00' '2017-03-06 00:00:00' '2017-03-08 00:00:00' '2017-03-09 00:00:00' '2017-03-10 00:00:00' '2017-03-15 00:00:00' '2017-03-18 00:00:00' '2017-03-22 00:00:00' '2017-03-25 00:00:00' '2017-03-31 00:00:00' '2017-04-04 00:00:00' '2017-04-05 00:00:00' '2017-04-07 00:00:00' '2017-04-06 00:00:00' '2017-04-10 00:00:00' '2017-04-08 00:00:00' '2017-04-11 00:00:00' '2017-04-13 00:00:00' '2017-04-12 00:00:00' '2017-04-23 00:00:00' '2017-04-19 00:00:00' '2017-04-25 00:00:00' '2017-04-24 00:00:00' '2017-04-28 00:00:00' '2017-04-29 00:00:00' '2017-04-30 00:00:00' '2017-05-05 00:00:00' '2017-05-06 00:00:00' '2017-05-10 00:00:00' '2017-05-16 00:00:00' '2017-05-17 00:00:00' '2017-05-18 00:00:00' '2017-05-19 00:00:00' '2017-05-23 00:00:00' '2017-05-30 00:00:00' '2017-06-04 00:00:00' '2017-06-09 00:00:00' '2017-06-11 00:00:00' '2017-06-14 00:00:00' '2017-06-15 00:00:00' '2017-06-17 00:00:00' '2017-06-18 00:00:00' '2017-06-24 00:00:00' '2017-06-20 00:00:00' '2017-06-23 00:00:00' '2017-06-19 00:00:00' '2017-06-22 00:00:00' '2017-06-29 00:00:00' '2017-07-04 00:00:00' '2017-07-05 00:00:00' '2017-07-06 00:00:00' '2017-07-09 00:00:00'] ------------------------------------------------------------ Unique values of "Country" column ------------------------------------------------------------ ['Country_01' 'Country_02' 'Country_03'] ------------------------------------------------------------ Unique values of "Local" column ------------------------------------------------------------ ['Local_01' 'Local_02' 'Local_03' 'Local_04' 'Local_05' 'Local_06' 'Local_07' 'Local_08' 'Local_10' 'Local_09' 'Local_11' 'Local_12'] ------------------------------------------------------------ Unique values of "Industry Sector" column ------------------------------------------------------------ ['Mining' 'Metals' 'Others'] ------------------------------------------------------------ Unique values of "Accident Level" column ------------------------------------------------------------ ['I' 'IV' 'III' 'II' 'V'] ------------------------------------------------------------ Unique values of "Potential Accident Level" column ------------------------------------------------------------ ['IV' 'III' 'I' 'II' 'V' 'VI'] ------------------------------------------------------------ Unique values of "Gender" column ------------------------------------------------------------ ['Male' 'Female'] ------------------------------------------------------------ Unique values of "Employee type" column ------------------------------------------------------------ ['Third Party' 'Employee' 'Third Party (Remote)'] ------------------------------------------------------------ Unique values of "Critical Risk" column ------------------------------------------------------------ ['Pressed' 'Pressurized Systems' 'Manual Tools' 'Others' 'Fall prevention (same level)' 'Chemical substances' 'Liquid Metal' 'Electrical installation' 'Confined space' 'Pressurized Systems / Chemical Substances' 'Blocking and isolation of energies' 'Suspended Loads' 'Poll' 'Cut' 'Fall' 'Bees' 'Fall prevention' '\nNot applicable' 'Traffic' 'Projection' 'Venomous Animals' 'Plates' 'Projection/Burning' 'remains of choco' 'Vehicles and Mobile Equipment' 'Projection/Choco' 'Machine Protection' 'Power lock' 'Burn' 'Projection/Manual Tools' 'Individual protection equipment' 'Electrical Shock' 'Projection of fragments']
Let's analyze the missing values available on the dataset.
df.isnull().sum()
Date 0 Country 0 Local 0 Industry Sector 0 Accident Level 0 Potential Accident Level 0 Gender 0 Employee type 0 Critical Risk 0 Description 0 dtype: int64
mno.matrix(df, figsize = (10, 4));
Observation: Here we can see the current dataset doesn’t have any missing values.
def month2seasons(x):
if x in [9, 10, 11]:
season = 'Spring'
elif x in [12, 1, 2]:
season = 'Summer'
elif x in [3, 4, 5]:
season = 'Autumn'
elif x in [6, 7, 8]:
season = 'Winter'
return season
def preprocess_data(df):
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].apply(lambda x : x.year)
df['Month'] = df['Date'].apply(lambda x : x.month)
df['Day'] = df['Date'].apply(lambda x : x.day)
df['Weekday'] = df['Date'].apply(lambda x : x.day_name())
df['WeekofYear'] = df['Date'].apply(lambda x : x.weekofyear)
df['Season'] = df['Month'].apply(month2seasons)
df['Is_Holiday'] = [1 if str(val).split()[0] in brazil_holidays else 0 for val in df['Date']]
return df
def print_brazil_holidays(year):
print('--'*40); print('List of Brazil holidays in ' + str(year)); print('--'*40)
for date in holidays.Brazil(years = year).items():
print(date)
def get_brazil_holidays(years):
brazil_holidays = []
for date in holidays.Brazil(years = years).items():
brazil_holidays.append(str(date[0]))
return brazil_holidays
print_brazil_holidays(2016)
print_brazil_holidays(2017)
-------------------------------------------------------------------------------- List of Brazil holidays in 2016 -------------------------------------------------------------------------------- (datetime.date(2016, 1, 1), 'Ano novo') (datetime.date(2016, 4, 21), 'Tiradentes') (datetime.date(2016, 5, 1), 'Dia Mundial do Trabalho') (datetime.date(2016, 9, 7), 'Independência do Brasil') (datetime.date(2016, 10, 12), 'Nossa Senhora Aparecida') (datetime.date(2016, 11, 2), 'Finados') (datetime.date(2016, 11, 15), 'Proclamação da República') (datetime.date(2016, 12, 25), 'Natal') (datetime.date(2016, 3, 25), 'Sexta-feira Santa') (datetime.date(2016, 3, 27), 'Páscoa') (datetime.date(2016, 5, 26), 'Corpus Christi') (datetime.date(2016, 2, 10), 'Quarta-feira de cinzas (Início da Quaresma)') (datetime.date(2016, 2, 9), 'Carnaval') -------------------------------------------------------------------------------- List of Brazil holidays in 2017 -------------------------------------------------------------------------------- (datetime.date(2017, 1, 1), 'Ano novo') (datetime.date(2017, 4, 21), 'Tiradentes') (datetime.date(2017, 5, 1), 'Dia Mundial do Trabalho') (datetime.date(2017, 9, 7), 'Independência do Brasil') (datetime.date(2017, 10, 12), 'Nossa Senhora Aparecida') (datetime.date(2017, 11, 2), 'Finados') (datetime.date(2017, 11, 15), 'Proclamação da República') (datetime.date(2017, 12, 25), 'Natal') (datetime.date(2017, 4, 14), 'Sexta-feira Santa') (datetime.date(2017, 4, 16), 'Páscoa') (datetime.date(2017, 6, 15), 'Corpus Christi') (datetime.date(2017, 3, 1), 'Quarta-feira de cinzas (Início da Quaresma)') (datetime.date(2017, 2, 28), 'Carnaval')
brazil_holidays = get_brazil_holidays([2016, 2017])
brazil_holidays
['2016-01-01', '2016-04-21', '2016-05-01', '2016-09-07', '2016-10-12', '2016-11-02', '2016-11-15', '2016-12-25', '2016-03-25', '2016-03-27', '2016-05-26', '2016-02-10', '2016-02-09', '2017-01-01', '2017-04-21', '2017-05-01', '2017-09-07', '2017-10-12', '2017-11-02', '2017-11-15', '2017-12-25', '2017-04-14', '2017-04-16', '2017-06-15', '2017-03-01', '2017-02-28']
df = preprocess_data(df)
df.head(3)
| Date | Country | Local | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee type | Critical Risk | Description | Year | Month | Day | Weekday | WeekofYear | Season | Is_Holiday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-01-01 | Country_01 | Local_01 | Mining | I | IV | Male | Third Party | Pressed | While removing the drill rod of the Jumbo 08 f... | 2016 | 1 | 1 | Friday | 53 | Summer | 1 |
| 1 | 2016-01-02 | Country_02 | Local_02 | Mining | I | IV | Male | Employee | Pressurized Systems | During the activation of a sodium sulphide pum... | 2016 | 1 | 2 | Saturday | 53 | Summer | 0 |
| 2 | 2016-01-06 | Country_01 | Local_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | In the sub-station MILPO located at level +170... | 2016 | 1 | 6 | Wednesday | 1 | Summer | 0 |
Reusable functions
def univariate_analysis(col, df, height=600, width=900):
fig = make_subplots(rows=1, cols=2, specs=[[{"type": "xy"}, {"type": "domain"}]])
labels = df[col].value_counts().index
values = df[col].value_counts().values
colors = px.colors.qualitative.Plotly + px.colors.qualitative.D3 + px.colors.qualitative.Vivid
fig.add_trace(go.Bar(x=labels, y=values, name=col, marker=dict(color=colors), showlegend=False), row=1, col=1)
fig.add_trace(go.Pie(labels=labels, values=values, name=col, marker=dict(colors=colors)), row=1, col=2)
fig.update_layout(height=height, width=width, legend=dict(title=col))
fig.show()
columns = df.drop(columns=['Date', 'Description', 'WeekofYear', 'Day']) .columns
for col in columns:
univariate_analysis(col, df)
def bivariate_analysis(df, col, hue):
fig = px.histogram(df, x=df[col], color=hue, width=800, height=400, title=f'{hue} vs {col} analysis')
fig.show()
hue = 'Gender'
columns = ['Employee type', 'Country', 'Industry Sector', 'Is_Holiday', 'Accident Level', 'Weekday', 'Year', 'Season']
for col in columns:
bivariate_analysis(df, col, hue)
hue = 'Accident Level'
columns = ['Employee type', 'Country', 'Industry Sector', 'Is_Holiday', 'Gender', 'Weekday', 'Year', 'Season', 'Local']
for col in columns:
bivariate_analysis(df, col, hue)
def pre_process_for_ml(df):
df['Country'] = df['Country'].replace({'Country_01': 1, 'Country_02': 2, 'Country_03': 3})
df['Local'] = df['Local'].replace({'Local_01': 1, 'Local_02': 2, 'Local_03': 3, 'Local_04': 4, \
'Local_05': 5, 'Local_06': 6, 'Local_07': 7, 'Local_08': 8, \
'Local_09': 9, 'Local_10': 10, 'Local_11': 11, 'Local_12': 12})
df['Industry Sector'] = df['Industry Sector'].replace({'Mining': 1, 'Metals': 2, 'Others': 3})
df['Gender'] = df['Gender'].replace({'Male': 1, 'Female': 2})
df['Employee type'] = df['Employee type'].replace({'Third Party': 1, 'Employee': 2, 'Third Party (Remote)': 3})
df['Critical Risk'] = LabelEncoder().fit_transform(df['Critical Risk'])
df['Year'] = df['Year'].replace({'2016': 1, '2017': 2})
df['Weekday'] = df['Weekday'].replace({'Monday': 1, 'Tuesday': 2, 'Wednesday': 3, 'Thursday': 4,\
'Friday': 5, 'Saturday': 6, 'Sunday': 7})
df['Season'] = df['Season'].replace({'Summer': 1, 'Autumn': 2, 'Winter': 3, 'Spring': 4})
df['Accident Level'] = df['Accident Level'].replace({'I': 1, 'II': 2, 'III': 3, 'IV': 4, 'V': 5, 'VI': 6})
df['Potential Accident Level'] = df['Potential Accident Level'].replace({'I': 1, 'II': 2, 'III': 3, 'IV': 4, 'V': 5, 'VI': 6})
return df
new_df = pre_process_for_ml(df.copy())
plt.figure(figsize=(15,6))
sns.heatmap(new_df.corr(), annot=True, cmap='Blues')
plt.show()
df.describe().T.style.bar(
subset=['mean'],
color='Reds').background_gradient(
subset=['std'], cmap='ocean').background_gradient(subset=['50%'], cmap='PuBu')
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Year | 418.000000 | 2016.322967 | 0.468170 | 2016.000000 | 2016.000000 | 2016.000000 | 2017.000000 | 2017.000000 |
| Month | 418.000000 | 5.267943 | 3.186449 | 1.000000 | 3.000000 | 5.000000 | 7.000000 | 12.000000 |
| Day | 418.000000 | 15.076555 | 8.618416 | 1.000000 | 8.000000 | 15.000000 | 22.000000 | 31.000000 |
| WeekofYear | 418.000000 | 21.033493 | 13.998418 | 1.000000 | 9.000000 | 18.000000 | 30.000000 | 53.000000 |
| Is_Holiday | 418.000000 | 0.023923 | 0.152994 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
def preprocess_text(text):
# Expand the contractions
text = contractions.fix(text)
# Remove URLs
text = re.sub(r"https?://\S+|www\.\S+", "", text)
# Remove HTML tags if any
html = re.compile(r"<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});")
text = re.sub(html, "", text)
# Remove Non ANSCI
text = re.sub(r'[^\x00-\x7f]', "", text)
# Remove imojis
emoji_pattern = re.compile(
'['
u'\U0001F600-\U0001F64F'
u'\U0001F300-\U0001F5FF'
u'\U0001F680-\U0001F6FF'
u'\U0001F1E0-\U0001F1FF'
u'\U00002702-\U000027B0'
u'\U000024C2-\U0001F251'
']+',
flags=re.UNICODE)
text = emoji_pattern.sub(r'', text)
# Remove all special characters
text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
# Convert to lowercase
text = text.lower()
# Remove unnecessary spaces
text = text.strip()
tokens = word_tokenize(text)
word_list = [w for w in tokens if w not in stopwords.words('english')]
return word_list
def stem_text(text):
stemmer = PorterStemmer()
stems = [stemmer.stem(i) for i in text]
return stems
def extract_pos_tags(text):
words = word_tokenize(text)
tagged = nltk.pos_tag(words)
return tagged
wordnet_map = {
"N":wordnet.NOUN,
"V":wordnet.VERB,
"J":wordnet.ADJ,
"R":wordnet.ADV
}
train_sents = brown.tagged_sents(categories='news')
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
def extract_pos_tags(text, pos_tag_type="pos_tag"):
pos_tagged_text = t2.tag(text)
pos_tagged_text = [(word, wordnet_map.get(pos_tag[0])) if pos_tag[0] in wordnet_map.keys() else (word, wordnet.NOUN) for (word, pos_tag) in pos_tagged_text ]
return pos_tagged_text
def lemmatize_text(text):
lemmatizer = WordNetLemmatizer()
lemma = [lemmatizer.lemmatize(word, tag) for word, tag in text]
return lemma
df['description_processed'] = df['Description'].apply(lambda t: ' '.join(preprocess_text(t)))
df['description_processed_stemmed'] = df['Description'].apply(lambda t: preprocess_text(t)).apply(lambda t: ' '.join(stem_text(t)))
df['description_processed_lemmatized'] = df['Description'].apply(lambda t: preprocess_text(t)).apply(lambda t: extract_pos_tags(t)).apply(lambda t: ' '.join(lemmatize_text(t)))
df.head(3)
| Date | Country | Local | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee type | Critical Risk | Description | Year | Month | Day | Weekday | WeekofYear | Season | Is_Holiday | description_processed | description_processed_stemmed | description_processed_lemmatized | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-01-01 | Country_01 | Local_01 | Mining | I | IV | Male | Third Party | Pressed | While removing the drill rod of the Jumbo 08 f... | 2016 | 1 | 1 | Friday | 53 | Summer | 1 | removing drill rod jumbo 08 maintenance superv... | remov drill rod jumbo 08 mainten supervisor pr... | removing drill rod jumbo 08 maintenance superv... |
| 1 | 2016-01-02 | Country_02 | Local_02 | Mining | I | IV | Male | Employee | Pressurized Systems | During the activation of a sodium sulphide pum... | 2016 | 1 | 2 | Saturday | 53 | Summer | 0 | activation sodium sulphide pump piping uncoupl... | activ sodium sulphid pump pipe uncoupl sulfid ... | activation sodium sulphide pump pip uncoupled ... |
| 2 | 2016-01-06 | Country_01 | Local_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | In the sub-station MILPO located at level +170... | 2016 | 1 | 6 | Wednesday | 1 | Summer | 0 | substation milpo located level 170 collaborato... | substat milpo locat level 170 collabor excav w... | substation milpo locate level 170 collaborator... |
def draw_wordcloud(df, col, bigrams=True):
text = " ".join(i for i in df[col])
stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, background_color="whitesmoke", collocations=bigrams).generate(text)
plt.figure( figsize=(15,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
draw_wordcloud(df, 'description_processed_lemmatized', bigrams=True)
Here we can see the most used words on the description_processed_lemmatized field with bigrams as True. Few of them are listed below:
draw_wordcloud(df, 'description_processed_lemmatized', bigrams=False)
Here we can see the most used words on the description_processed_lemmatized field with bigrams as False. Few of them are listed below:
df.head(3)
| Date | Country | Local | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee type | Critical Risk | Description | Year | Month | Day | Weekday | WeekofYear | Season | Is_Holiday | description_processed | description_processed_stemmed | description_processed_lemmatized | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-01-01 | Country_01 | Local_01 | Mining | I | IV | Male | Third Party | Pressed | While removing the drill rod of the Jumbo 08 f... | 2016 | 1 | 1 | Friday | 53 | Summer | 1 | removing drill rod jumbo 08 maintenance superv... | remov drill rod jumbo 08 mainten supervisor pr... | removing drill rod jumbo 08 maintenance superv... |
| 1 | 2016-01-02 | Country_02 | Local_02 | Mining | I | IV | Male | Employee | Pressurized Systems | During the activation of a sodium sulphide pum... | 2016 | 1 | 2 | Saturday | 53 | Summer | 0 | activation sodium sulphide pump piping uncoupl... | activ sodium sulphid pump pipe uncoupl sulfid ... | activation sodium sulphide pump pip uncoupled ... |
| 2 | 2016-01-06 | Country_01 | Local_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | In the sub-station MILPO located at level +170... | 2016 | 1 | 6 | Wednesday | 1 | Summer | 0 | substation milpo located level 170 collaborato... | substat milpo locat level 170 collabor excav w... | substation milpo locate level 170 collaborator... |
df.to_csv('Accident_data_cleansed.csv', index=False)
files.download('Accident_data_cleansed.csv')
df = pd.read_csv('Accident_data_cleansed.csv')
df.head(2)
| Date | Country | Local | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee type | Critical Risk | Description | Year | Month | Day | Weekday | WeekofYear | Season | Is_Holiday | description_processed | description_processed_stemmed | description_processed_lemmatized | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-01-01 | Country_01 | Local_01 | Mining | I | IV | Male | Third Party | Pressed | While removing the drill rod of the Jumbo 08 f... | 2016 | 1 | 1 | Friday | 53 | Summer | 1 | removing drill rod jumbo 08 maintenance superv... | remov drill rod jumbo 08 mainten supervisor pr... | removing drill rod jumbo 08 maintenance superv... |
| 1 | 2016-01-02 | Country_02 | Local_02 | Mining | I | IV | Male | Employee | Pressurized Systems | During the activation of a sodium sulphide pum... | 2016 | 1 | 2 | Saturday | 53 | Summer | 0 | activation sodium sulphide pump piping uncoupl... | activ sodium sulphid pump pipe uncoupl sulfid ... | activation sodium sulphide pump pip uncoupled ... |
univariate_analysis('Accident Level', df)
Here we can see the dataset is imbalanced by considering the Accident Level field as the target variable.
def get_vocabularies(X):
vocabulary = set()
for description in X:
words = word_tokenize(description)
vocabulary.update(words)
vocabulary = list(vocabulary)
return vocabulary
X = df['description_processed_lemmatized']
y = df['Accident Level'].replace({'I': 1, 'II': 2, 'III': 3, 'IV': 4, 'V': 5})
print(X.shape, y.shape)
(418,) (418,)
vocabulary = get_vocabularies(X.values)
print(len(vocabulary))
2975
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, test_size=0.2, random_state=7)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
(334,) (334,) (84,) (84,)
stop_words = stopwords.words('english') + list(punctuation)
vectorizer = TfidfVectorizer(stop_words=stop_words, tokenizer=word_tokenize, vocabulary=vocabulary)
vectorizer.fit(X_train)
TfidfVectorizer(stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
'ourselves', 'you', "you're", "you've", "you'll",
"you'd", 'your', 'yours', 'yourself', 'yourselves',
'he', 'him', 'his', 'himself', 'she', "she's",
'her', 'hers', 'herself', 'it', "it's", 'its',
'itself', ...],
tokenizer=<function word_tokenize at 0x000001D0FD7E8940>,
vocabulary=['crown', 'pivot', 'purification', 'ordinary',
'bolt', 'applies', 'elevation', 'branch',
'assembling', 'roger', 'shaft', 'new', 'knuckle',
'respective', 'technical', 'a1', 'chagua', 'spare',
'spin', 'virdro', 'isidro', 'melting', 'thrust',
'ob1', 'mx12', 'teammate', 'sodium', 'sudden',
'scooptram', 'mount', ...])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. TfidfVectorizer(stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
'ourselves', 'you', "you're", "you've", "you'll",
"you'd", 'your', 'yours', 'yourself', 'yourselves',
'he', 'him', 'his', 'himself', 'she', "she's",
'her', 'hers', 'herself', 'it', "it's", 'its',
'itself', ...],
tokenizer=<function word_tokenize at 0x000001D0FD7E8940>,
vocabulary=['crown', 'pivot', 'purification', 'ordinary',
'bolt', 'applies', 'elevation', 'branch',
'assembling', 'roger', 'shaft', 'new', 'knuckle',
'respective', 'technical', 'a1', 'chagua', 'spare',
'spin', 'virdro', 'isidro', 'melting', 'thrust',
'ob1', 'mx12', 'teammate', 'sodium', 'sudden',
'scooptram', 'mount', ...])X_train_vec = vectorizer.transform(X_train)
X_test_vec = vectorizer.transform(X_test)
print(X_train_vec.shape, X_test_vec.shape)
(334, 2975) (84, 2975)
X_o = df[['description_processed_lemmatized']]
y_o = df['Accident Level'].replace({'I': 1, 'II': 2, 'III': 3, 'IV': 4, 'V': 5})
X_train_o, X_test_o, y_train_o, y_test_o = train_test_split(X_o, y_o, test_size=0.2, random_state=7)
random_over = RandomOverSampler()
print("Before UpSampling, counts of label 'I': {}".format(sum(y_train_o==1)))
print("Before UpSampling, counts of label 'II': {}".format(sum(y_train_o==2)))
print("Before UpSampling, counts of label 'III': {}".format(sum(y_train_o==3)))
print("Before UpSampling, counts of label 'IV': {}".format(sum(y_train_o==4)))
print("Before UpSampling, counts of label 'V': {}".format(sum(y_train_o==5)))
Before UpSampling, counts of label 'I': 243 Before UpSampling, counts of label 'II': 32 Before UpSampling, counts of label 'III': 23 Before UpSampling, counts of label 'IV': 29 Before UpSampling, counts of label 'V': 7
X_train_o, y_train_o = random_over.fit_resample(X_train_o, y_train_o.ravel())
print("Before UpSampling, counts of label 'I': {}".format(sum(y_train_o==1)))
print("Before UpSampling, counts of label 'II': {}".format(sum(y_train_o==2)))
print("Before UpSampling, counts of label 'III': {}".format(sum(y_train_o==3)))
print("Before UpSampling, counts of label 'IV': {}".format(sum(y_train_o==4)))
print("Before UpSampling, counts of label 'V': {}".format(sum(y_train_o==5)))
Before UpSampling, counts of label 'I': 243 Before UpSampling, counts of label 'II': 243 Before UpSampling, counts of label 'III': 243 Before UpSampling, counts of label 'IV': 243 Before UpSampling, counts of label 'V': 243
X_train_o_vect = vectorizer.transform(X_train_o['description_processed_lemmatized'].values)
X_test_o_vect = vectorizer.transform(X_test_o['description_processed_lemmatized'].values)
def get_ml_model_results(X_train, y_train, X_test, y_test):
models = {
'Multinomial NB': MultinomialNB(),
'Logistic Regression': LogisticRegression(),
'Gaussian NB': GaussianNB(),
'KNN': KNeighborsClassifier(),
'SVM': SVC(),
'Decision Tree': DecisionTreeClassifier(criterion='entropy', max_depth=10, random_state=50, min_samples_leaf=7),
'Random Forest': RandomForestClassifier(n_estimators=50, max_samples=7),
'Bagging': BaggingClassifier(n_estimators=100, max_samples=10),
'Ada Boost': AdaBoostClassifier(n_estimators=100),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, learning_rate=0.05),
'RidgeClassifier': RidgeClassifier(random_state=1),
}
names = []
prediction = []
train_scores = []
test_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
ps, rs, fs, _ = precision_recall_fscore_support(y_test, y_pred, average='weighted')
names.append(name)
prediction.append(y_pred)
train_scores.append(round(train_score * 100, 2))
test_scores.append(round(test_score * 100, 2))
precision_scores.append(round(ps * 100, 2))
recall_scores.append(round(rs * 100, 2))
f1_scores.append(round(fs * 100, 2))
results = pd.DataFrame({
'Model': names,
'Train Accuracy': train_scores,
'Test Accuracy': test_scores,
'Precision': precision_scores,
'Recall': recall_scores,
'F1': f1_scores
})
return results
def get_ml_model(model, X_train, y_train):
models = {
'Multinomial NB': MultinomialNB(),
'Logistic Regression': LogisticRegression(),
'Gaussian NB': GaussianNB(),
'KNN': KNeighborsClassifier(),
'SVM': SVC(),
'Decision Tree': DecisionTreeClassifier(criterion='entropy', max_depth=10, random_state=50, min_samples_leaf=7),
'Random Forest': RandomForestClassifier(n_estimators=50, max_samples=7),
'Bagging': BaggingClassifier(n_estimators=100, max_samples=10),
'Ada Boost': AdaBoostClassifier(n_estimators=100),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, learning_rate=0.05),
'RidgeClassifier': RidgeClassifier(random_state=1)
}
selected_model = models[model]
if selected_model == None:
print('Model not available')
return
selected_model.fit(X_train, y_train)
return selected_model
def visualize_model_score(result_df, col):
result_df = result_df.sort_values(by=col, ascending=False)
fig = px.bar(result_df, x='Model', y=col, color=col, \
width=1200, height=500, text=[f'{str(i)} %' for i in result_df[col]],\
color_continuous_scale='blugrn')
fig.update_layout(title=f'{col} score of Models', title_x=0.5)
fig.update_yaxes(range=[0, 100])
fig.show()
model_results = get_ml_model_results(X_train_vec.todense(), y_train, X_test_vec.todense(), y_test)
model_results.sort_values(by='Test Accuracy', ascending=False)
| Model | Train Accuracy | Test Accuracy | Precision | Recall | F1 | |
|---|---|---|---|---|---|---|
| 0 | Multinomial NB | 72.75 | 78.57 | 61.73 | 78.57 | 69.14 |
| 1 | Logistic Regression | 72.75 | 78.57 | 61.73 | 78.57 | 69.14 |
| 2 | Gaussian NB | 99.40 | 78.57 | 61.73 | 78.57 | 69.14 |
| 4 | SVM | 79.64 | 78.57 | 61.73 | 78.57 | 69.14 |
| 6 | Random Forest | 72.75 | 78.57 | 61.73 | 78.57 | 69.14 |
| 7 | Bagging | 72.75 | 78.57 | 61.73 | 78.57 | 69.14 |
| 10 | RidgeClassifier | 97.90 | 78.57 | 61.73 | 78.57 | 69.14 |
| 3 | KNN | 73.95 | 76.19 | 62.86 | 76.19 | 68.88 |
| 8 | Ada Boost | 74.25 | 76.19 | 61.32 | 76.19 | 67.95 |
| 9 | Gradient Boosting | 99.40 | 76.19 | 62.08 | 76.19 | 68.42 |
| 5 | Decision Tree | 76.35 | 71.43 | 61.22 | 71.43 | 65.93 |
visualize_model_score(model_results, 'Train Accuracy')
visualize_model_score(model_results, 'Test Accuracy')
visualize_model_score(model_results, 'Precision')
visualize_model_score(model_results, 'Recall')
visualize_model_score(model_results, 'F1')
best_ml_model = get_ml_model('Multinomial NB', X_train_vec.todense(), y_train)
I = df[df['Accident Level']=='I']
II = df[df['Accident Level']=='II']
III = df[df['Accident Level']=='III']
IV = df[df['Accident Level']=='IV']
V = df[df['Accident Level']=='V']
I_input = I['Description'].iloc[10]
best_ml_model.predict(vectorizer.transform([I_input]))
array([1], dtype=int64)
II_input = II['Description'].iloc[10]
best_ml_model.predict(vectorizer.transform([II_input]))
array([1], dtype=int64)
Since we have used imabalced data for the creating model, the best performing model is not predicting correctly
model_results_o = get_ml_model_results(X_train_o_vect.todense(), y_train_o, X_test_o_vect.todense(), y_test_o)
model_results_o.sort_values(by=['Train Accuracy', 'Test Accuracy'], ascending=False)
| Model | Train Accuracy | Test Accuracy | Precision | Recall | F1 | |
|---|---|---|---|---|---|---|
| 2 | Gaussian NB | 99.26 | 78.57 | 61.73 | 78.57 | 69.14 |
| 4 | SVM | 99.18 | 78.57 | 61.73 | 78.57 | 69.14 |
| 1 | Logistic Regression | 99.18 | 76.19 | 62.08 | 76.19 | 68.42 |
| 10 | RidgeClassifier | 99.18 | 76.19 | 67.42 | 76.19 | 70.18 |
| 9 | Gradient Boosting | 99.09 | 65.48 | 61.73 | 65.48 | 63.55 |
| 0 | Multinomial NB | 96.30 | 60.71 | 75.38 | 60.71 | 66.10 |
| 3 | KNN | 91.52 | 41.67 | 64.91 | 41.67 | 50.34 |
| 5 | Decision Tree | 82.63 | 54.76 | 64.97 | 54.76 | 59.38 |
| 7 | Bagging | 65.43 | 22.62 | 66.91 | 22.62 | 29.20 |
| 6 | Random Forest | 52.92 | 51.19 | 74.59 | 51.19 | 58.65 |
| 8 | Ada Boost | 28.48 | 8.33 | 0.81 | 8.33 | 1.48 |
visualize_model_score(model_results_o, 'Train Accuracy')
visualize_model_score(model_results_o, 'Test Accuracy')
visualize_model_score(model_results_o, 'Precision')
visualize_model_score(model_results_o, 'Recall')
visualize_model_score(model_results_o, 'F1')
best_ml_model_bal = get_ml_model('SVM', X_train_o_vect, y_train_o)
I_input = I['Description'].iloc[10]
best_ml_model_bal.predict(vectorizer.transform([I_input]))
array([1], dtype=int64)
V_input = V['Description'].iloc[5]
best_ml_model_bal.predict(vectorizer.transform([V_input]))
array([5], dtype=int64)
III_input = III['Description'].iloc[5]
best_ml_model_bal.predict(vectorizer.transform([III_input]))
array([3], dtype=int64)
This time we have used balanced dataset for preparing the model, also we can see the model is predicting well. Going forward we build be building all the models with Upsampled(balanced) dataset, i.e. X_train_o, y_train_o, X_test_o and y_test_o
def reset_seeds(seed):
np.random.seed(seed)
python_random.seed(seed)
set_seed(seed)
def get_basic_nn_model(X_train):
reset_seeds(0)
clear_session()
model = Sequential()
model.add(Dense(64, input_shape=(X_train.shape[1], )))
model.add(Activation('sigmoid'))
model.add(Dense(24))
model.add(Activation('sigmoid'))
model.add(Dense(6))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='SGD')
return model
def get_nn_model_with_weight(X_train):
reset_seeds(0)
clear_session()
model = Sequential()
model.add(Dense(64, input_shape=(X_train.shape[1], ), kernel_initializer='he_normal'))
model.add(Activation('sigmoid'))
model.add(Dense(24, kernel_initializer='he_normal'))
model.add(Activation('sigmoid'))
model.add(Dense(6, kernel_initializer='he_normal'))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='SGD')
return model
def get_nn_model_with_relu(X_train):
reset_seeds(0)
clear_session()
model = Sequential()
model.add(Dense(64, input_shape=(X_train.shape[1], ), kernel_initializer='he_normal'))
model.add(Activation('relu'))
model.add(Dense(24))
model.add(Activation('relu'))
model.add(Dense(6))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='SGD')
return model
def get_nn_model_with_batch_normalization(X_train):
reset_seeds(0)
clear_session()
model = Sequential()
model.add(Dense(64, input_shape=(X_train.shape[1], ), kernel_initializer='he_normal'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dense(24))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dense(6))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='SGD')
return model
def get_nn_model_with_dropout(X_train):
reset_seeds(0)
clear_session()
model = Sequential()
model.add(Dense(64, input_shape=(X_train.shape[1], ), kernel_initializer='he_normal'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(24))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(6))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='SGD')
return model
def get_nn_model_results(X_train, y_train, X_test, y_test):
models = {
'Basic NN Model': get_basic_nn_model(X_train),
'NN Model with Weight Initialization': get_nn_model_with_weight(X_train),
'NN Model with Relu Activation': get_nn_model_with_relu(X_train),
'NN Model with Batch Normalization': get_nn_model_with_batch_normalization(X_train),
'NN Model with Dropout': get_nn_model_with_dropout(X_train)
}
names = []
train_scores = []
test_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
for name, model in models.items():
model.fit(X_train, y_train, epochs=100, batch_size=8, validation_data=(X_test, y_test), verbose=False)
train_loss, train_score = model.evaluate(X_train, y_train, verbose=0)
test_loss, test_score = model.evaluate(X_test, y_test, verbose=0)
y_pred = model.predict(X_test, batch_size=64, verbose=0)
y_pred_bool = np.argmax(y_pred, axis=1)
y_test_bool = np.argmax(y_test, axis=1)
ps, rs, fs, _ = np.average(precision_recall_fscore_support(y_test_bool, y_pred_bool), axis=1)
names.append(name)
train_scores.append(round(train_score * 100, 2))
test_scores.append(round(test_score * 100, 2))
precision_scores.append(round(ps * 100, 2))
recall_scores.append(round(rs * 100, 2))
f1_scores.append(round(fs * 100, 2))
results = pd.DataFrame({
'Model': names,
'Train Accuracy': train_scores,
'Test Accuracy': test_scores,
'Precision': precision_scores,
'Recall': recall_scores,
'F1': f1_scores
})
return results
def get_nn_model(model, X_train, y_train):
models = {
'Basic NN Model': get_basic_nn_model(X_train),
'NN Model with Weight Initialization': get_nn_model_with_weight(X_train),
'NN Model with Relu Activation': get_nn_model_with_relu(X_train),
'NN Model with Batch Normalization': get_nn_model_with_batch_normalization(X_train),
'NN Model with Dropout': get_nn_model_with_dropout(X_train)
}
selected_model = models[model]
if selected_model == None:
print('Model not available')
return
selected_model.fit(X_train, y_train, epochs=100, batch_size=8, verbose=False)
return selected_model
y_train_o_cat = to_categorical(y_train_o, num_classes=None)
y_test_o_cat = to_categorical(y_test_o, num_classes=None)
nn_model_results = get_nn_model_results(X_train_o_vect.todense(), y_train_o_cat, X_test_o_vect.todense(), y_test_o_cat)
WARNING:tensorflow:5 out of the last 9 calls to <function Model.make_predict_function.<locals>.predict_function at 0x000001D0868F3AF0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details.
nn_model_results.sort_values(by=['Train Accuracy', 'Test Accuracy'], ascending=False)
| Model | Train Accuracy | Test Accuracy | Precision | Recall | F1 | |
|---|---|---|---|---|---|---|
| 2 | NN Model with Relu Activation | 99.26 | 77.38 | 35.80 | 21.89 | 21.86 |
| 3 | NN Model with Batch Normalization | 99.26 | 75.00 | 15.95 | 19.09 | 17.38 |
| 4 | NN Model with Dropout | 99.26 | 75.00 | 16.15 | 19.09 | 17.50 |
| 1 | NN Model with Weight Initialization | 48.81 | 67.86 | 17.31 | 36.97 | 19.21 |
| 0 | Basic NN Model | 40.00 | 78.57 | 15.71 | 20.00 | 17.60 |
visualize_model_score(nn_model_results, 'Train Accuracy')
visualize_model_score(nn_model_results, 'Test Accuracy')
visualize_model_score(nn_model_results, 'Precision')
visualize_model_score(nn_model_results, 'Recall')
visualize_model_score(nn_model_results, 'F1')
best_nn_model = get_nn_model('NN Model with Relu Activation', X_train_o_vect.todense(), y_train_o_cat)
np.argmax(best_nn_model.predict(vectorizer.transform([I['Description'].iloc[13]]).todense()))
WARNING:tensorflow:6 out of the last 11 calls to <function Model.make_predict_function.<locals>.predict_function at 0x000001D0874BB8B0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. 1/1 [==============================] - 0s 41ms/step
1
np.argmax(best_nn_model.predict(vectorizer.transform([III['Description'].iloc[13]]).todense()))
1/1 [==============================] - 0s 18ms/step
3
np.argmax(best_nn_model.predict(vectorizer.transform([V['Description'].iloc[7]]).todense()))
1/1 [==============================] - 0s 18ms/step
5
best_nn_model.save('best_nn_model.h5')
loaded_model = load_model('best_nn_model.h5')
np.argmax(loaded_model.predict(vectorizer.transform(['Hand cut off']).todense()))
1/1 [==============================] - 0s 45ms/step
1
np.argmax(loaded_model.predict(vectorizer.transform([V['Description'].iloc[7]]).todense()))
1/1 [==============================] - 0s 17ms/step
5
The best performing model is predicting well.
def get_simple_lstm_model(max_len, top_words):
clear_session()
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_len))
model.add(LSTM(100))
model.add(Dense(6, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
def get_lstm_with_dropout_model(max_len, top_words):
clear_session()
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_len))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(6, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
def get_bidirectional_lstm_model(max_len, top_words):
clear_session()
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_len))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(100)))
model.add(Dropout(0.2))
model.add(Dense(6, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
def get_lstm_and_cnn_model(max_len, top_words):
clear_session()
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_len))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(100))
model.add(Dense(6, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
def get_rnn_lstm_model_results(X_train, y_train, X_test, y_test, max_len, top_words):
models = {
'Simple LSTM Model': get_simple_lstm_model(max_len, top_words),
'LSTM with Droput': get_lstm_with_dropout_model(max_len, top_words),
'Bidirectional LSTM': get_bidirectional_lstm_model(max_len, top_words),
'LSTM and CNN': get_lstm_and_cnn_model(max_len, top_words)
}
names = []
train_scores = []
test_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
for name, model in models.items():
print('Preparing model ', name)
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=64, verbose=False)
train_loss, train_score = model.evaluate(X_train, y_train, verbose=0)
test_loss, test_score = model.evaluate(X_test, y_test, verbose=0)
y_pred = model.predict(X_test, batch_size=64, verbose=0)
y_pred_bool = np.argmax(y_pred, axis=1)
y_test_bool = np.argmax(y_test, axis=1)
ps, rs, fs, _ = np.average(precision_recall_fscore_support(y_test_bool, y_pred_bool), axis=1)
names.append(name)
train_scores.append(round(train_score * 100, 2))
test_scores.append(round(test_score * 100, 2))
precision_scores.append(round(ps * 100, 2))
recall_scores.append(round(rs * 100, 2))
f1_scores.append(round(fs * 100, 2))
results = pd.DataFrame({
'Model': names,
'Train Accuracy': train_scores,
'Test Accuracy': test_scores,
'Precision': precision_scores,
'Recall': recall_scores,
'F1': f1_scores
})
return results
def get_rnn_lstm_model(model, X_train, y_train, max_len, top_words):
models = {
'Simple LSTM Model': get_simple_lstm_model(max_len, top_words),
'LSTM with Droput': get_lstm_with_dropout_model(max_len, top_words),
'Bidirectional LSTM': get_bidirectional_lstm_model(max_len, top_words),
'LSTM and CNN': get_lstm_and_cnn_model(max_len, top_words)
}
selected_model = models[model]
if selected_model == None:
print('Model not available')
return
selected_model.fit(X_train, y_train, epochs=10, batch_size=64, verbose=False)
return selected_model
def prepare_input(text):
input_seq = tokenizer.texts_to_sequences([text])
input_pad = pad_sequences(input_seq, maxlen=max_len)
return input_pad
Here we are re-using the already over-sampled train and test sets, which are - X_train_o, X_test_o, y_train_o and y_test_o
X = pd.concat([X_train_o, X_test_o])
top_words = 5000
tokenizer = Tokenizer(num_words=top_words)
tokenizer.fit_on_texts(X['description_processed_lemmatized'])
max_len = len(tokenizer.word_index)
max_len
2975
X_train_o_seq = tokenizer.texts_to_sequences(X_train_o['description_processed_lemmatized'])
X_test_o_seq = tokenizer.texts_to_sequences(X_test_o['description_processed_lemmatized'])
X_train_o_pad = pad_sequences(X_train_o_seq, maxlen=max_len)
X_test_o_pad = pad_sequences(X_test_o_seq, maxlen=max_len)
y_train_o_cat = to_categorical(y_train_o, num_classes=None)
y_test_o_cat = to_categorical(y_test_o, num_classes=None)
rnn_lstm_model_results = get_rnn_lstm_model_results(X_train_o_pad, y_train_o_cat, X_test_o_pad, y_test_o_cat, max_len, top_words)
Preparing model Simple LSTM Model Preparing model LSTM with Droput Preparing model Bidirectional LSTM Preparing model LSTM and CNN
rnn_lstm_model_results.sort_values(by=['Train Accuracy', 'Test Accuracy'], ascending=False)
| Model | Train Accuracy | Test Accuracy | Precision | Recall | F1 | |
|---|---|---|---|---|---|---|
| 3 | LSTM and CNN | 99.26 | 73.81 | 29.12 | 23.18 | 24.17 |
| 1 | LSTM with Droput | 99.18 | 59.52 | 24.95 | 21.74 | 21.88 |
| 2 | Bidirectional LSTM | 91.85 | 41.67 | 21.26 | 23.79 | 19.40 |
| 0 | Simple LSTM Model | 86.67 | 41.67 | 36.64 | 32.50 | 17.83 |
visualize_model_score(rnn_lstm_model_results, 'Train Accuracy')
visualize_model_score(rnn_lstm_model_results, 'Test Accuracy')
visualize_model_score(rnn_lstm_model_results, 'Precision')
visualize_model_score(rnn_lstm_model_results, 'Recall')
visualize_model_score(rnn_lstm_model_results, 'F1')
best_lstm_model = get_rnn_lstm_model('LSTM and CNN', X_train_o_pad, y_train_o_cat, max_len, top_words)
input_I = prepare_input(I['Description'].iloc[5])
np.argmax(best_lstm_model.predict(input_I))
1/1 [==============================] - 0s 76ms/step
1
input_III = prepare_input(III['Description'].iloc[2])
np.argmax(best_lstm_model.predict(input_III))
1/1 [==============================] - 0s 77ms/step
3
input_V = prepare_input(V['Description'].iloc[5])
np.argmax(best_lstm_model.predict(input_V))
1/1 [==============================] - 0s 73ms/step
5
best_lstm_model.save('best_lstm_model.h5')
I = df[df['Accident Level']=='I']
II = df[df['Accident Level']=='II']
II = df[df['Accident Level']=='III']
IV = df[df['Accident Level']=='IV']
V = df[df['Accident Level']=='V']
loaded_lstm_model = load_model('best_lstm_model.h5')
input_I = prepare_input(I['Description'].iloc[5])
np.argmax(loaded_lstm_model.predict(input_I))
1/1 [==============================] - 0s 402ms/step
1
input_III = prepare_input(III['Description'].iloc[2])
np.argmax(loaded_lstm_model.predict(input_III))
1/1 [==============================] - 0s 152ms/step
3
input_V = prepare_input(V['Description'].iloc[5])
np.argmax(loaded_lstm_model.predict(input_V))
1/1 [==============================] - 0s 147ms/step
5
The best performing model is predicting well.
all_model_scores = pd.concat([model_results_o, nn_model_results, rnn_lstm_model_results])
all_model_scores.sort_values(by=['Test Accuracy'], ascending=False)
| Model | Train Accuracy | Test Accuracy | Precision | Recall | F1 | |
|---|---|---|---|---|---|---|
| 2 | Gaussian NB | 99.26 | 78.57 | 61.73 | 78.57 | 69.14 |
| 4 | SVM | 99.18 | 78.57 | 61.73 | 78.57 | 69.14 |
| 0 | Basic NN Model | 40.00 | 78.57 | 15.71 | 20.00 | 17.60 |
| 2 | NN Model with Relu Activation | 99.26 | 77.38 | 35.80 | 21.89 | 21.86 |
| 10 | RidgeClassifier | 99.18 | 76.19 | 67.42 | 76.19 | 70.18 |
| 1 | Logistic Regression | 99.18 | 76.19 | 62.08 | 76.19 | 68.42 |
| 4 | NN Model with Dropout | 99.26 | 75.00 | 16.15 | 19.09 | 17.50 |
| 3 | NN Model with Batch Normalization | 99.26 | 75.00 | 15.95 | 19.09 | 17.38 |
| 3 | LSTM and CNN | 99.26 | 73.81 | 29.12 | 23.18 | 24.17 |
| 1 | NN Model with Weight Initialization | 48.81 | 67.86 | 17.31 | 36.97 | 19.21 |
| 9 | Gradient Boosting | 99.09 | 65.48 | 61.73 | 65.48 | 63.55 |
| 0 | Multinomial NB | 96.30 | 60.71 | 75.38 | 60.71 | 66.10 |
| 1 | LSTM with Droput | 99.18 | 59.52 | 24.95 | 21.74 | 21.88 |
| 5 | Decision Tree | 82.63 | 54.76 | 64.97 | 54.76 | 59.38 |
| 6 | Random Forest | 52.92 | 51.19 | 74.59 | 51.19 | 58.65 |
| 3 | KNN | 91.52 | 41.67 | 64.91 | 41.67 | 50.34 |
| 0 | Simple LSTM Model | 86.67 | 41.67 | 36.64 | 32.50 | 17.83 |
| 2 | Bidirectional LSTM | 91.85 | 41.67 | 21.26 | 23.79 | 19.40 |
| 7 | Bagging | 65.43 | 22.62 | 66.91 | 22.62 | 29.20 |
| 8 | Ada Boost | 28.48 | 8.33 | 0.81 | 8.33 | 1.48 |
visualize_model_score(all_model_scores, 'Train Accuracy')
visualize_model_score(all_model_scores, 'Test Accuracy')
Here we can see the top performing models are
All of these models having the Test Accuracy of 78.57%
The next best performing model id NN Model with Relu Activation which having 77.38% accuracy
best_model = get_ml_model('SVM', X_train_o_vect, y_train_o)
joblib.dump(best_model, 'best_model.pkl')
['best_model.pkl']
loaded_best_model = joblib.load('best_model.pkl')
II_input = II['Description'].iloc[10]
loaded_best_model.predict(vectorizer.transform([II_input]))
array([2], dtype=int64)
IV_input = IV['Description'].iloc[10]
loaded_best_model.predict(vectorizer.transform([IV_input]))
array([4], dtype=int64)